Machine Learning Overview

Prof.ssa Ileana Baldi
UBEP

Course Outline week 2


Machine learning overview

  1. Performance assessment
    • model evaluation
    • model selection
  2. Resampling methods
    • Cross-validation
    • Bootstrap

Performance assessment

Loss functions for numerical ouput:

\[ L(Y,\hat{f}(X))=\begin{cases} |Y-\hat{f}(X)| \\ (Y-\hat{f}(X))^2 \end{cases} \]

Loss functions for categorical ouput: \[ L(Y,\hat{f}(X))=\begin{cases} I(Y \ne\hat{f}(X)) \\ -2{log(\hat p_k(X))} \end{cases} \]

Test vs training error rate

  • Training error rate is simply the average error from a statistical learning method that uses predictions based on the data that was used to fit the model in the first place:

\[\frac 1 n \sum_{i=1}^n L(y_i,\hat{f}(x_i))\]

  • is the average error from using a statistical learning method to predict the response on a new observation:

\[\mathit{E}(L(Y,\hat{f}(X))\]

(qui metto la referenza sulle formule)

Test vs training error rate

  • We utilize the test error rate to measure model performance, since training error rate can dramatically over-estimate real-world performance.

(The Elements of Statistical Learning. Hastie T, Tibshirani R, Friedman. Springer 2008)

Test error

There are two common approaches to estimate test error:

  1. We can directly estimate the test error, using either a validation set approach or a cross-validation approach.

  2. We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting (e.g. AIC, BIC,…)

Goals

  • Model selection: estimating the performance of different models in order to choose the best one.
  • Model assessment: having chosen a final model, estimating its test error on new data.

Solutions

In a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts:

  • a training set (usually \(50\%n\))
    • to fit the model
  • a validation set (\(\sim 25\%n\))
    • to estimate prediction error for model selection
  • a test set (\(\sim 25\%n\))
    • for assessment of the test error of the final chosen model

Solutions


  • Sometimes, simply splitting the data to obtain testing and training dataset is not recommended due to the small size of the sample…
  • Here we instead consider a class of methods that estimate the test error by holding out a subset of the training observations from the splitting process, and then applying the statistical learning method to those held out observations

From validation to cross-validation (CV)

  • Training an algorithm and evaluating its statistical performance on the same data yields an overoptimistic result.
  • CV was raised to fix this issue, starting from the remark that testing the output of the algorithm on new data would yield a good estimate of its performance
  • In most real applications, only a limited amount of data is available, which leads to the idea of splitting the data

Data splitting or Hold out

  • The more basic validation set approach
  • This method randomly divides the available data into two parts
    • the training sample is used for training the algorithm
    • the validation sample is used for evaluating the performance of the algorithm
    • the validation sample can play the role of “new data” as long as data are i.i.d..
  • A single data split yields a validation estimate of the risk, and averaging over several splits yields a cross-validation estimate.

Example of ICAM predictions (through KNN)

Left panel shows single split; right panel shows multiple splits

Data splitting or Hold out - advantages

  • The validation approach is simple and easy to implement
  • It allows hypotheses to be confirmed in the test sample.

Data splitting or Hold out - drawbacks

  • It reduces the sample size for both model development and model testing.
  • It requires a larger sample to be held out than cross-validation to be able to obtain the same precision of the estimate of predictive accuracy.
  • The split may be fortuitous.
  • Data-splitting does not validate the final model, but rather a model developed on only a subset of the data. The training and test sets are recombined for fitting the final model, which is not validated.
  • Data-splitting requires the split before the first analysis of the data.

Resampling methods

  • Underlying idea:
    • analyses can proceed in the usual way on the complete dataset
    • after a “final” model is specified, the modeling process is rerun on multiple resamples from the original data to mimic the process that produced the “final” model.
    • by repeatedly randomly sample the data we attain a more robust measure of fit

Resampling methods

  • Resampling methods have become and integral tool in modern statistics.
  • Resampling is based on repeatedly drawing samples from a training set of observations and refitting a model on each sample in order to obtain additional insights into that model.
  • The goal for any resampling measure is to better estimate how a statistical model will perform on out of sample, ‘real-life’ data.

Resampling methods

  • Cross-Validation
  • Bootstrap

Cross-validation

  • CV estimates the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical method to those held out observations.
  • This process can be repeated several times to enable more data to partake in the training of the model, one subset at a time.